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Abstract 



This paper introduces several fundamental con- 
cepts in information theory from the perspective 
of their origins in engineering. Understanding such 
concepts is important in neuroscience for two rea- 
sons. Simply applying formulae from information 
theory without understanding the assumptions be- 
hind their definitions can lead to erroneous results 
and conclusions. Furthermore, this century will 
see a convergence of information theory and neuro- 
science; information theory will expand its founda- 
tions to incorporate more comprehensively biolog- 
ical processes thereby helping reveal how neuronal 
networks achieve their remarkable information pro- 
cessing abilities. 



1 Introduction 

Norbert Wiener, the founder of cybernetics, wrote 
that it is the "boundary regions of science which 
offer the richest opportunities to the qualified in- 
vestigator. They are at the same time the most re- 
fractory to the accepted tec hniques of mas s attack 
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and the division of labor" ([Wiener , 
He went on to explain that "a proper exploration of 
these blank spaces on the map of science could only 
be made by a team of scientists, each a specialist 
in his own field but each possessing a thoroughly 
sound and trained acquaintance with the fields of 
his neighbors; all in the habit of working together, 
of knowing one another's intellectual customs, and 
of recognizing the significance of a colleague's new 
suggestion before it has taken on a full formal ex- 
pression. The mathematician need not have the 
skill to conduct a physiological experiment, but he 



must have the skill to understand one, to criticize 
one, and to suggest one. The physiologist need not 
be able to prove a certain mathematical theorem, 
but he must be able to grasp its physiological sig- 
nificance and to tell the mathematician for what 
he should look." 

Indeed, three giants of science, Wiener, von 
Neumann and Shannon, realised in the 1940s 
the need for understanding the brain in terms 
of the fundamental engineering principles applica- 
ble to any computatio nal device: energy, entropy 
and feedback (Wienen 1948t Ivon Neumannl . l2000t 
Shannon and Weaveilll949t ) This led to the Macy 
conferences (1946-1953) which attracted leading 
scientists from across engineering and the physi- 
cal and life sciences. The Macy conferences were 
one of the earliest organised approaches to trans- 
disciplinarity and hailed by some as the most im- 
portant event in the history of science after World 
War II. They demonstrated the need for, and the 
initial difficulties in, establishing a common lan- 
guage powerful enough to communicate the intri- 
cacies of the relevant fields across the physical and 
life sciences and engineering. 

While their dream was not realised, this was pri- 
marily due to insufficient experimental data. As 
time marched on, the barrier to bringing together 
the ever more specialised disciplines grew larger. 
With tremendous experimental advances having 
been made in the past 60 years, it is timely to stand 
on the shoulders of these giants and resume their 
quest. With this as motivation, the present article 
endeavours to whet the appetites of neuroscientists 
and information theorists alike for learning more of 
each other's fields. 



1.1 Relevance of Information 
ory (and Feedback) 



The- 



The human brain is often 
most complex structure in 



described as the 
the known uni- 
verse ([Fischbachl . |1992[ ). Certainly, it is the most 
efScient signal processing device known. Drawing 
only 20 watts of power, the brain significantly out- 
performs engineered devices at signal processing 
tasks such as source separation, featur e extraction , 
and speech and image recognition (jSarpeshkai . 
19981) . This is all the more remarkable because sig- 
nals within the brain propagate very slowly com- 
pared with those in a computer. 

This suggests that the brain uses a paradigm for 
signal processing very different from any developed 
in engineering. Why then should engineering in 
general and information theory in particular have 
relevance to understanding the brain? The answer 
lies partially in the fact that engineers study fun- 
damental laws p ertinen t to any system, includin g 
biological ones (jBergeil l2003t ISarpeshkan . Il998[ ). 
Indeed, John von Neumann viewed the brain as 
a hybrid computer which performs control, com- 
munication and computation, and concluded that 
information theory is theref ore essential for under- 
standing its functionality ( von Neumannl . 120001 ). 
Wiener too recognised that information theory was 
essential for a deeper understanding of feedback 
and thus life (jWienen . 119481 ) . Anecdotal evidence 
suggests Shannon himself, the father of informa- 
tion theory, may have been partially motivated by 
how his brain processed "information" when per- 
forming a complex task such as juggling balls. 

A few words on the concept of feedback are in 
order. Feedback refers to achieving a task, such as 
keeping a car travelling at a constant speed, by re- 
peatedly measuring the current state, such as the 
car's speed, and feeding those measurements back 
and using them to make the requisite changes at 
the input, such as applying more or less pressure to 
the accelerator of the car. Feedback is a fundamen- 
tal concept in engineering because it can militate 
errors caused by imprecisions and external inter- 
ference. 

The brain too must use feedback to over- 



come impreci s ions (IBurdet et al.l. 120011: IWiene: 



1948t iMarkol Il967l: iTodorov and Jordanl |200: 



Burdet et all . 120061: iFranklin et al.l . 120081 ): with- 
out feedback, we would fall over whenever we at- 
tempted to walk. Within the sensory pathways 
there are tremendous numbers of feedback paths 
connecting regions of higher-level brain function 
to regions of lower-level functionality, giving rise 
to top-down processing theories of the visual path- 
ways and providing a mechanism for selective at- 
tention. 

Although they started out in different disciplines 
— information theory emerged from communica- 



tion theory while feedback was studied in con- 
trol theory — recent years have seen some con- 
vergence of feedback and information theory. A 
fundamental question is what is the slowest rate 
at which information must be fed back for the sys- 
tem to work. Scientists have started to consider 
how fast the brain must be processing informa- 
tion if we are able to walk properly and can move 
our hand in a straight line even though random 
extern a.1 forces are impedin g its motion in experi- 



ments (|Burdet et al.l . |2006j) . This is an example of 



such convergence of two important theories. 

By virtue of being introductory, the present pa- 
per focuses on discussing information theory in 
"one-way" (i.e. feed-forward exclusively) biologi- 
cal contexts. A comprehensive account of the brain 
in engineering terms would necessarily involve the 
marriage of information theory and control theory. 
Repeating the words of von Neumann, the brain 
must be understood in terms of control (feedback) , 
communication (information theory) and compu- 
tation. 

1.2 Information Theory and Com- 
munication 

It must be recognised that the mathematical disci- 
pline of "information theory" does not (and should 
not) capture all aspects of how the word "informa- 
tion" is used in spoken language. Failing to distin- 
guish the two can lead to errors caused by flawed 
intuition in one direction, or the inappropriate ap- 
plication of information theory in the othei|^. 

Information theory was invented in response to 
practical problems faced by the designers of com- 
munication systems such as telephones and data 
modems. The basic problem is to find an efficient 
way of transmitting information from one place to 
another, whether it be a probe sending information 
from the moon back to earth or a mobile telephone 
sending and receiving voice and internet packets. 
Indeed, consider the problem of one person trying 
to send a series of messages to another person on 
the other side of a brick wall; for simplicity, assume 
this second person is not allowed to speak or send 
any other form of message to the first person such 
as an acknowledgement or request for clarification 
(or "retransmission" ) . How should the first person 
send each message? 

One thing is clear; the louder the first person 
shouts, the greater the chance that the second per- 
son can understand the message over the back- 
ground noise (perhaps the neighbours arc mowing 



^Blindly applying the mathematical equations defining 
entropy or channel capacity does not necessarily endow the 
resulting quantities with any meaning or validity; infor- 
mation theoretic quantities can be understood only with 
respect to the assumptions and limitations of information 
theory. 



their lawns) . Perhaps a httle harder to appreciate 
but equally true in the digital world, if the first 
person were to speak more slowly the second per- 
son would have a greater chance of catching every 
word. The third parameter that can be adjusted 
is the level of redundancy. When we speak with 
a young child we tend to elaborate and use more 
words to describe a concept in an attempt to in- 
crease the chances of correct reception of the over- 
all message. 

In information theory, these parameters are re- 
ferred to formally as the transmission power, the 
transmission rate and coding (or redundancy). 
The simplest form of coding is to repeat the mes- 
sage two or more times. This is known as a repe- 
tition code. 

Shannon's pioneering work shattered a long-held 
belief that with finite power it was impossible to 
be able to transmit a message in such a way as 
to guarantee its perfect reception even in the pres- 
ence of noise and other interference. Indeed, even 
if I shouted at the top of my voice and repeated 
myself a hundred times, every so often the inter- 
ference (lawn mowers?) will prove too great and 
my message will be lost. 

The answer lies in coding; repetition codes are 
not particularly good codes. Shannon realised that 
there exist very clever codes which can ensure that 
any two messages are so different from each other 
that the receiver can correctly decide which mes- 
sage was sent despite the interference. Technically, 
perfect reception requires the receiver to listen for- 
ever before deciding which message was sent but 
the key point is that given any positive but arbi- 
trarily small probability of error (such as one mes- 
sage being incorrectly received in 10^^ messages) 
then a code can be constructed which achieves this 
level of performance in finite time, and more im- 
portantly, the transmission power does not need 
to be increased. Increasing transmission power to 
achieve a particular error rate is grossly inefficient 
compared with choosing a better code. In the ex- 
ample of one person trying to convey a message to 
the other person, the secret is to share a codebook 
beforehand, and a different sequence of sounds, one 
for each message that may be sent, is written in 
it. "It will rain tomorrow" might be encoded as 
(a segment of) Beethoven's 5th symphony while 
"It will be sunny tomorrow" might be encoded as 
a hard rock song. These two encoded messages 
are "sufficiently different from each other" to have 
very little chance of being confused. More impor- 
tantly, any small fragment of the two messages are 
different. This is how interference is overcome. Be- 
cause interference itself has limited power (other- 
wise the game would not be fair!) then even if there 
are times when the interference is particularly bad, 
there will be other times when the interference is 



back to normal and in the long run, there is no 
confusing Beethoven for hard rock. 

For the transmission rate to be acceptable the 
codebooks would need to contain more than just 
two messages. (With two messages, each one en- 
coded by a five-minute song, the transmission rate 
would be 1 bit per 5 minutes.) In the same way 
that the transmission power does not need to go to 
infinity, the transmission rate need not go to zero. 
Precisely, Shannon discovered a quantity known as 
the channel capacity. If the transmission rate is 
less than the channel capacity then communica- 
tion with any desired level of accuracy is possible 
whereas if the transmission rate exceeds the chan- 
nel capacity then it becomes impossible to have 
arbitrarily good performance in finite time. As is 
to be expected, the channel capacity depends on 
the interference. The more destructive the inter- 
ference the lower the channel capacity. 

1.3 Intra-organism communication 

Messages are also passed around within an organ- 
ism. Information gathered by an organism's senses 
must be communicated for it to have any effect on 
the organism. Within the brain and nervous sys- 
tem, information is manipulated in at least three 
different ways: 

• information acquisition (sensory transduc- 
tion); 

• communication between spatially separated 
regions (information transmission); 

• memory formation and recall (information 
storage (jVarshnev et al.l . 120061). 



In brains, each of these are essential for the emer- 
gence of broader functions that might be termed 
'computation'. Communication is perhaps most 
fundamental, since information from the senses 
needs to be communicated in order for it to have 
any affect on an organism, while information stor- 
age, whether in computers or brain memories, can 
be viewed as communication from the past to the 
present or future. 

Understanding how the brain stores and trans- 
mits information is tantamount to understanding 
the brain as a whole because if we could "listen 
in" to brain messages it would surely be just a 
matter of time before we understood the computa- 
tional side, too. That said, the possibility that the 
brain does not separate out information transmis- 
sion from information processing must be consid- 
ered. Whereas computer architectures have mech- 
anisms known as buses for moving information be- 
tween different processing units, the brain might 
take a more efficient distributed approach and si- 
multaneously process and communicate in a non- 



separable fashion. It has been stated that "com- 
putation in the brain always means that informa- 
tion is moved from one place to another" (Buzsaki, 
I2OO6I p. 116). A comprehensive understanding 
of the brain's mechanisms for internal communi- 
cation will likely form an integral part of more 
advanced theories about how 'computation' arises 
within brain networks. 

Regardless of how the brain actually processes 
information, at the end of the day the brain is an 
input-output system (we react to what we sense) 
and therefore subject to the same laws as any 
other input-output system. Information theory is 
therefore relevant to understanding how the brain 
works, and conversely, it is highly likely that ad- 
vances in the field of information theory will be 
made in synergy with new discoveri es of the com- 
puting paradigms used by the b rain (jCohenl . 12004 : 



Sarpeshkan . 119981 : iBergen . 120031 ). Indeed, informa 



tion theory may have to expand to address new 
neurobiologically relevant questions if it is to be 
powerful enough to explain all aspects of how the 
brain manipulates information. 

To demonstrate the relevance of even simply 
thinking in information theoretic terms, Landauer 
estimated that humans learn in formation at a ra te 
of about two bits per second (JLandaueii Il986ll4 
Taking memory loss into account, a person will ac- 
cumulate approximately two billion bits of infor- 
mation in a lifetime (or approximately 240MB in 
computing terms) . Since our brain has many more 
synapses than two billion, Landauer concludes that 
"possibly we should not be looking for models and 
mechanisms that produce storage economies but 
rather ones in which marvels are produced by prof- 
ligate use of capacity." 

1.4 Outline of Paper 

According to CohenI (|2004l ). biological science asks 
six kinds of questions about domains ranging from 
molecules and cells, up to the biosphere: 

1. How is it built? (Structures) 

2. How does it work? (Mechanisms) 

3. What is it for? (Functions) 

4. What goes wrong? (Pathologies) 

5. How is it fixed? (Repairs) 

6. How did it begin? (Origins) 

Utilising information theory in neuroscience is ul- 
timately useful only if it can address one or more 
of these questions. 



In this paper we advocate that information the- 
ory 1) can be a useful framework for finding an- 
swers to some of these questions; but 2) must be 
broadened for its theorems to be directly applica- 
ble to neuronal networks. Although information 
manipulation can happen at very different levels 
of organisation, such as storage of information in 
genes, or communication at the level of synaptic 
transmission between cells, or at that of spiking 
patterns of neurons in a network, in this paper we 
will be focusing on examples that involve spiked- 
based communication between neurons. 

In making these points, it is necessary for us 
to introduce the most basic and well-known infor- 
mation theoretic concepts in Section [51 before dis- 
cussing the challenges of applying the theory mean- 
ingfully to questions in neuroscience in Section [31 
Then in Section [2 we summarise a specific ex- 
ample which illustrates that information theoretic 
approaches depend critically on different assump- 
tions that could be made about neural systems. 
Finally in Section [SI we conclude the paper with 
some closing remarks on the material we cover and 
briefiy summarise recent developments on informa- 
tion theoretic approaches in neuroscience that ex- 
tend well beyond the classical ideas we present, 
thereby with increasing relevance to neurobiologi- 
cal systems. 

2 The basics and utility of 
Shannon Information The- 
ory 

This section briefly explains key concepts from 
Shannon information theory and hints at possible 
contributions in neuroscience. By Shannon infor- 
mation theory we are referring to a specific sub- 
part of the broader field of information theory. 
The latter, by definition, encompasses any mathe- 
matical theorems about information^ and therefore 
is not confined to well-known concepts introduced 
by Shannon, such as entropy and mutual informa- 
tion. As we discuss later, information theory be- 
yond Shannon theory may be very important in 
neuroscience. 



^To put this in perspective, a digital camera typically 
stores a single photo using around 10,000,000 bits. Clearly 
then, we extract only a very small amount of information 
from what our senses receive. 



Shannon's milestone paper (jShannonl . 119481 ) that 
founded the field of information theory showed to 
the world that introducing the right kind of redun- 
dancy was the key to moving information from one 
place to another in an efficient and reliable man- 
ner. Since information sources such as spoken voice 
or PDF (portable document format) documents 
generally contain the wrong kind of redundant in- 
formation. Shannon proposed a two-step process: 
first remove the existing redundancy by compress- 
ing the message to be sent, then introduce the right 
kind of redundancy for communicating the message 



through the channel at hand. These two impor- 
tant concepts are known as "source coding" and 
"channel coding" respectively. They motivate sev- 
eral fundamental questions including determining 
the maximum amount of compression possible of 
an information source. Answers to these questions 
are given in terms of quantities such as entropy 
and mutual information. It is important to re- 
alise that these quantities were given special names 
because they serve to answer important questions 
for a particular class of problems. It would be a 
mistake to assume without additional justification 
that they are applicable or even meaningful beyond 
the bounds of the original questio ns for which they 
serve as the answers to. See e.g. Ijohnsoni (|2008ll 
for more discussion. 



2.1 Entropy and Source Coding 

Living in the digital age, readers will be famil- 
iar with compressing files. Zipping up a file to 
send to a friend is an example of lossless compres- 
sion. Generally (but not always) the compressed 
file will be smaller than the original yet no infor- 
mation has been lost; the friend can recover the 
original file by decompressing the compressed file 
(Fig. [T]). For compressing music or photos, signifi- 
cantly greater compression can be achieved by us- 
ing lossy compression algorithms such as MPS and 
JPE G. As the name suggests, some information is 
lost ( Berger and Gibsonl . |l998). The original can 



be recovered sufficiently well for a satisfactory com- 
promise to have been reached; a small amount of 
quality is sacrificed for a large saving in storage 
space. 
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Figure 1: Source Coding. 

The remainder of this section discusses lossless 
compression only. Consider the problem of com- 
pressing a short message of length 8 bits. A bit 
is simply a "0" or a "1" so an 8-bit message 
is a sequence of eight zeros and ones, such as 
"01011101" or "lOlOlOlO'H. A calculation (or by 
writing out all the possibilities if need be, starting 
with "00000000", "00000001" and continuing un- 
til "11111111") shows that there are precisely 256 
different 8-bit messages. Compressing a message 
would mean using fewer than 8 bits to store the 
message. A simple enumeration shows that this is 
impossible as stated; there are only 128 7-bit mes- 
sages, not enough to represent all possible 8-bit 



^Of course, any other alphabet could have been used. 
An 8-character message consisting of a string of 8 letters 
of the alphabet might look like "abzpuikq" and the same 
reasoning would apply. 



messages. How then does a computer compress a 
file losslessly? 

The secret is that there is often redundancy in 
the kinds of information that people are interested 
in. Equivalently, it is generally the case that not all 
messages have an equal chance of occurring. For 
argument's sake, assume that out of the 256 pos- 
sible messages, there are 15 messages which occur 
most of the time. To exploit this, we may decide to 
use 4 bits to represent each of these messages. Pre- 
cisely, "0000" would represent the first message, 
"0001" the second, up to "1110" for the 15th mes- 
sage. To represent any other message, we would 
first write down "1111" to mean "not one of the 
15" and then we would write down the original 
message using 8 bits. This means that 15 of the 
messages can be written down using only 4 bits 
but the remaining 256 — 15 = 241 messages now 
require 4 -f 8 = 12 bits for their storage. 

The only way to make this meaningful is to 
consider repeating this compression exercise many 
times. If we had to store a very large number TV of 
8-bit messages using this scheme, how many bits 
will be required? Assume that K out of the N 
messages belong to the set of 15 special messages. 
These K messages require 4 bits while the remain- 
ing N — K messages require 12 bits, or in total, 
AK + 12{N — K) bits are required compared with 
8A^ bits had we not compressed the messages. Pro- 
vided K is sufficiently large, we will have succeeded 
in compressing the data. For example, ii K = 750 
and N = 1000 then we would require only 6,000 
bits rather than the original 8,000 bits. 

The conventional way to describe the above sce- 
nario is to work with probabilities. We assume 
that the messages we are being asked to compress 
are being generated at random and there is no cor- 
relation between the message we are being asked 
to compress now and the messages we have al- 
ready compressed. Mathematically, we represent 
the original sequence of messages by a sequence 
of independent and identically distributed random 
variables {xi,X2,---}, each having a probability 
density p{X). In the above example, each Xk would 
be an 8-bit message (or equivalently, a number be- 
tween and 255 inclusive) and p{X) would be the 
probability that a particular message X is cho- 
sen. For concreteness, assume that each of the 
first fifteen messages have a 5% chance of occur- 
rence (meaning there is a 75% chance of a ran- 
domly chosen message being one of these 15 and 
thereby corresponding with the earlier choice of 
K = 750 and N = 1000 ). Then p(OOOOOOOO) = 
p(OOOOOOOl) = ••• = p(OOOOlllO) = 0.05. We 
assume that all other messages each have a proba- 
bility of 0.25/(256 — 15) of occurring. The expected 
number of bits required to compress a single mes- 
sage can then be calculated by summing over i the 



probability that the i th message occurs multiphed 
by the number of bits required to represent the i th 
message. When most of the probabihties are the 
same the calculation simplifies. With the values 
given above, the expected length is calculated to 



bel5x0.05x4+(256-15)x 



0.25 



X 12 = 6. Thus, 



■256-15_ 

on average, 6A'^ bits would be required to compress 
TV messages drawn at random if the above scheme 
were used. 

Is there a better compression scheme, one which 
requires fewer than 6 bits per message on average? 
In fact, what is the best possible? As elucidated 
presently. Shannon was able to answer these ques- 
tions. First though, a technicality needs mention- 
ing. 

Coding each message separately, as was done 
above, is inefficient. It is better to concatenate a 
series of messages and compress them all at once; 
this provides more opportunity for better compres- 
sion through the simple fact that there are more 
compression schemes to choose from. (It also alle- 
viates the wasted space caused by otherwise having 
to use an integer number of bits to represent each 
message.) 

It is therefore quite standard to refer to each 
Xk as a symbol rather than a message and ask 
how many bits per symbol on average must be 
used to compress the infinitely long sequence of 
independent and identically distributed symbols 
{xi, X2, • ■ • } if each symbol has a probability p{X) 
of occurringj. 

When X is a random variable and its distribu- 
tion is p{X), its entropy is defined as 



HiX) ^ -Ep^x)[logp{x)], 



(1) 



where Ep^x) [■] denotes the expectation with re- 
spect to p{X). The practical operation is a sum- 
mation when X is discrete and an integration when 
X is continuous. When 2 is the base of logarithm, 
i.e. log2, the units of entropy are bits and H{X) 
is precisely the number of bits per symbol required 
on average to compress an infinitely long sequence 
of symbols when each symbol has probability p{X) 
of occurring. 

It is for this reason that people endeavour to 
explain entropy as quantifying the "ambiguity" 
or "uncertainty" about the random variable X. 
When X has only one possible state (that must 
therefore occur with probability 1), there is no am- 
biguity about X and the entropy is 0. However if 
X takes one of two states with probability p and 
1 — p, respectively, (0 < p < 1), entropy is max- 
imised when p = 0.5 and H{X) = log2 2. This is 
exactly 1 bit and implies that a sequence of equally 



*It is traditional in probability theory to use an upper 
case letter to represent the random variable while the cor- 
responding lower case letter represents realised values. 



likely zeros and ones cannot be compressed. Note 
also that if p = 0, log2 = 0. 

Shannon's source coding theorem states that (in 
the limit as the number of symbols goes to infin- 
ity) it is possible to compress each symbol to H(X) 
bits on average (and impossible to do better). It 
does not however say how to design such a source 
code. Furthermore, the practical construction of 
compression and decompression methods is com- 
plicated by considerations of algorithmic efficiency 
(which affects battery life in portable equipment 
such as mobile telephones) and latency (how long 
the receiver must wait from the time a symbol is 
sent until that symbol can be received and de- 
coded). That said, having a target to aim for is 
extremely useful and entropy provides that target 
for source compression. 

The reader may wish to verify that for the ex- 
ample introduced in this section, the corresponding 
entropy is 5.72 bits per symbol. This represents the 
best any compression scheme can hope to achieve, 
and indeed, it is lower than the 6 bits per symbol 
scheme presented here. 

2.2 Mutual Information, Channel 
Capacity and Channel Coding 

The following example of a binary symmetric chan- 
nel will be used to add concreteness to the ensu- 
ing introduction of mutual information and chan- 
nel capacity. Let {si, S2, ■ • • } denote a binary se- 
quence which is to be transmitted to another per- 
son or device. It is called the source sequence. The 
medium through which a message can be sent from 
one person or device to another is called the chan- 
nel. Mathematically, a channel takes a sequence 
at its input and it generates another sequence at 
its output. If the channel were ideal, it would sim- 
ply copy its input to its output and communica- 
tion would be straightforward. Generally though, 
the channel is not ideal. It introduces random er- 
rors. If {xi} is the binary input sequence (which is 
shorthand notation for {xi, X2, ■ ■ ■}) then the bi- 
nary output sequence {xi} of a binary symmetric 
channel with error probability p is given by the 
rules that 1) for each integer i, the output Xi at 
time i depends only on the corresponding input Xi 
at the same time i; and 2) the probability that Xi 
differs from Xi is p. li p = 0.1 then on average one 
in every ten symbols will be corrupted, meaning 
either a was sent and a 1 was received, or a 1 
was sent and a was received. 

What sequence {xi} should be sent over the 
channel if the ultimate aim is to send {si} reliably 
to the receiver, assuming of course that the receiver 
can process the output {xi} of the channel before 
deciding what it believes the message {si} is? This 
is illustrated in Fig. [2] where the operation of gen- 



erating {xi} from {si} is called (channel) encoding 
and the operation of generating {si}, the receiver's 
best guess at the original message, is called (chan- 
nel) decoding. 
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Figure 2: Channel Coding. 

For simplicity, often the encoding and decod- 
ing processes work on blocks of data. Precisely, 
the original source sequence S is divided up into 
subsequences of length K. Each of these is en- 
coded to a longer binary sequence X with length 
A'^. For example, a simple K = 2, N = 3 block 
code would be to add a parity bit (i.e. a bit that 
is zero when the sequence has an even number of 
zeros, and a one when an odd number) after ev- 
ery two symbols, so: "00" becomes "000"; "01" 
becomes "Oil"; "10" becomes "101" and "11" be- 
comes "110." Therefore, the sequence "0111" be- 
comes "011110" where the 3rd and 6th bits are the 
introduced parity bits. 

This coded sequence is transmitted through the 
channel. At the other end, the receiver reverses 
the process, converting each block of N symbols 
back into a block of K symbols. In this particular 
case, introducing just a single parity bit does not 
allow the receiver to have a better guess at what 
the original message is, but it does allow the re- 
ceiver to detect if a single bit has been changed. 
This is called error detection. Error correction, 
when the receiver is not only able to detect an er- 
ror has occurred but can fix the error and therefore 
recover the original message, requires more redun- 
dancy to be introduced, that is, choosing N to be 
larger than K + 1. (If there are too many errors 
then error correction would fail, but the key point 
is that the probability that several consecutive bits 
are wrong is significantly smaller than if a single bit 
were wrong, therefore a small increase in redun- 
dancy allows a substantial increase in reliability.) 

The two lengths K and N together define the 
"rate" of the code, which perhaps is better un- 
derstood as measuring the decrease in throughput 
caused by the introduction of redundancy by the 
encoder. Precisely, in the above example, the rate 
of the code is i? = K/N, meaning that if the chan- 
nel can accept encoded symbols at a rate of 1 bit 
per second then the source symbols must have a 
rate of only R bits per second. 

Reducing the rate enables more redundancy to 
be introduced which can be used to increase the 
chance of the receiver being able to work out what 
message was sent. Shannon's remarkable observa- 



tion was that there is a much better way of in- 
creasing the chance of correct reception than by 
decreasing the rate towards zero. For a fixed rate 
R, the block size K can be increased (thereby in- 
creasing N according to the formula N = KR ). 
This allows a more sophisticated form of redun- 
dancy to be introduced (but at the price of intro- 
ducing greater latency; the receiver must receive 
N symbols before it can work out what the corre- 
sponding K message symbols were). 

Shannon proved that there exists a rate C, called 
the channel capacity, such that for any rate R 
strictly less than C and any desired error rate e > 
(meaning that the probability that the receiver de- 
codes a bit incorrectly is less than e, which might 
be chosen to be e = 10^^ or smaller in practice), 
there exists a K (possibly quite large) and a block 
encoder and decoder pair such that the receiver 
can correctly decode each bit of the source mes- 
sage with error probability less than e. This is 
customarily summarised by saying that error-free 
communication is possible at rates below the chan- 
nel capacit5|j. 

Shannon was able to give a formula for comput- 
ing the channel capacity C. When this formula 
(described below) is applied to the above exam- 
ple of a binary symmetric channel with probabil- 
ity of error p, the channel capacity is found to be 
C = 1 + p\og2P + (1 — p) log2(l — p), meaning for 
example that if the channel can transfer one bit per 
second then the source symbols must arrive slower 
than C bits per second. If p = 0.1 then C — 0.531 
meaning that for every 1,000 source symbols, just 
over 1,883 encoded symbols are required for reli- 
able communications. 

The formula for channel capacity involves a 
quantity called mutual information. Intuitively, 
the mutual information of the input and the output 
of the channel measures how much information the 
output provides about the input; the more reliable 
the channel the higher the mutual information. It 
is therefore reasonable to expect that the larger the 
mutual information the greater the channel capac- 
ity. 

Bearing in mind that "information" is a very 
general word and it is therefore not possible to cap- 
ture all its nuances in a single mathematical defi- 
nition, it is expedient to return to the idea in the 
previous section of using asymptotic compressibil- 
ity as a measure of information. It turns out that 
this is the right definition to use when it comes 
to determining the capacity of a channel (which in 
itself is an asymptotic measure) . 

Suppose there are two random variables, X and 



^Shannon also proved the converse, that no scheme (even 
non-block-based coding schemes) can achieve arbitrarily 
small error rates if their rate is greater than or equal to 
C. 



Y, and they are somehow related to each other. 
For example, X might denote temperature while 

Y denotes humidity. Even simpler, X might rep- 
resent the outcome of rolling a 6-sided die while 

Y is given the value if the die landed on an even 
number, or 1 if odd. Knowing Y gives partial infor- 
mation about X; how can we measure how much 
information Y tells us about XI 

The fact that Y gives partial information about 
X is reflected in the fact that if Y is known then X 
can be compressed more than if Y were not known. 
In the above example, if Y were not known then 
it is impossible to compress X because each of the 
outcomes is equally likely; we are forced to use one 
of six possible symbols (or log2 6 bits) to store each 
sample of X; the entropy oi X is H{X) = log2 6. 
If Y is known though then only one of three possi- 
ble symbols (or logj 3 bits) needs to be stored; the 
average conditional entropy is H{X\Y) — log2 3. 
The additional amount of compression possible, 
IiX;Y) = H{X) - H{X\Y), is called the mutual 
information and measures the amount of informa- 
tion Y provides about X. It turns out that mutual 
information is symmetric — I{X\Y) — I{Y\X) 
— hence there is no need to specify the order of X 
and Y . In the above example, if X is known then 

Y is known, therefore no additional bits are re- 
quired to store y if X is known: H{Y\X) — 0. 
Since H{Y) = log2 3 it is indeed the case that 
/(y; X) - H{Y) - H{Y\X) = log2 3 = I{X; Y% 

Returning to the channel capacity calculation, 
assume that a sequence generated by X is sent 
through the channel. The output sequence is it- 
self generated by a random variable, call it Y. If 
the receiver wants to recover X, it needs at least 
an extra H{X\Y) bits of information (for otherwise 
there would be an even more efficient scheme for 
compressing X than the best possible, a contra- 
diction). Looking at it from another angle though, 
this implies that I{X; Y) = H{X)~H{X\Y) bits of 
information have somehow been transmitted suc- 
cessfully with each use of the channel (since with 
an extra H{X\Y) bits of carefully chosen informa- 
tion it is theoretically possible to recover X). For 
the case of the binary symmetric channel with error 
probability p, a reasonably straightforward calcu- 
lation shows that if the input X takes the value 1 
with probability q and the value with probability 
1 — g then the mutual information of the input X 
and the output Y is I{X; Y) = H{q) - H{p) where 
H{e) = -61 log2 9- (1-9) log2(l - 9) is the num- 
ber of bits required to compress a binary sequence 



°The reason for this symmetry is that if we were to com- 
press X first then compress K, or if we were to compress 
Y first then compress X, we end up either way with hav- 
ing compressed optimally the joint sequence generated by 
X and Y. Mathematically H{X, Y) = H{X) + H{Y\X) = 
H{Y) + H(X\Y) from which it follows immediately that 
I{X;Y} = I{Y;X). 



taking the value 1 with probability 9 and the value 

with probability 1 — 9. 

Although we must have the channel input X rep- 
resent the source message S in some way, there 
is otherwise arbitrary freedom in how to choose 
X. Why not choose X to maximise the mutual 
information? The largest value H{q) can take is 

1 (which occurs when q = 1/2 ). Therefore, the 
largest number of bits we can ever expect to trans- 
mit reliably through the binary symmetric channel 
is 1 — H{p) bits per usage of channel. If p = 0.1 
then 1 — H{p) — 0.469. In this case, at most ev- 
ery 0.469 bits of the source message must be ex- 
panded to 1 bit (since the channel transmits 1 bit 
per usage), or in other words, we must have the 
rate of the code (see above) satisfy K/N < 0.469. 
Remarkably, Shannon proved that this bound is 
achievable; whenever the rate is less than the max- 
imum of the mutual information, (as close as you 
like to) error-free communication is possiblqj. 

It is of interest to note that while here we have 
considered an example where X is a discrete ran- 
dom variable, the most well known case of a chan- 
nel for which the capacity achieving input distribu- 
tion is known, is the additive Gaussian noise chan- 
nel, with a power constraint on the input. In this 
case, the capacity achieving input distribution is in 
fact continuous, i.e. a Gaussian distribution. As 
we discuss later though, it is far more common for 
the capacity achieving input to be discrete. 

We now summarise and precisely define the im- 
portant information theoretic terms that we have 
introduced and discussed above without stating 
their formal definitions. Each of these are defined 
mathematically as follows. We already introduced 
entropy, in Eqn. ([3]). The average conditional en- 
tropy requires a double expectation: 

H(Y\X) = -E^(x) [EpiY\.) [logp{y\x)]] . (2) 

As an aid to intuition, consider a single outcome of 
the random variable X. The entropy of Y given X 
can be calculated from Eqn. ^ by calculating the 
expectation with respect to the conditional distri- 
bution of Y given X — X . If this is carried out for 
all possible outcomes of X, the result is a function 
of X. This function can then be averaged with re- 
spect to the distribution oix, and by definition, the 
result is the average conditional entropy, H{Y\X). 
As mentioned above, the mutual information can 
be expressed as I{X;Y) = H{X) - H{X\Y) = 
H{Y) — H{Y\X) . In what follows below we write 
mutual information in a different form based on 
another entity called relative entropy or Kullhack- 



^Note that Shannon proved "it is achievable" in the limit 
when N and K goes to infinity, but did not show "how to 
achieve it." In order to be close to the bound, we generally 
need a good error correction code and A'^ and K must be 
very large. 



Leibler divergence. This is defined as 



D{p{X)\\q{X)) = Ep^x) 



log 



q{x) 



where p{X) and q{X) are two distributions of the 
same random variable X. Note that the relative 
entropy is positive and is equal to if p{X) is 
identical to q{X). Mutual information is defined 
as the relative entropy between the joint distribu- 
tion of X and Y, and the product of the marginal 
distributions of X and Y: 



I{X;Y) = D{P{X,Y)\\P{X)P{Y)) 



= E, 



'p(X,Y) 



log 



p{x,y) 
p{x)p{y) 



(3) 



It is straightforward using p{x,y) = p{y\x)p{x) = 
p{x\y)p{y) to obtain the above stated relationships 
between mutual information and entropy. The def- 
initions as written here hold for both discrete and 
continuous distributions of X and Y . In this sec- 
tion we have considered only a simple discrete case, 
where X and Y are both binary. In general they 
can have any number of states, or be continuous, 
as is the case below. In full generality, the channel 
capacity is defined as 

C ^ sup I{X]Y). 

P{X) 

3 Challenges of Utilising In- 
formation Theory in Neuro- 
science 

In this section, some of the challenges of integrating 
information theory into neuroscience are touched 
upon. In particular, we must make assumptions 
about the way in which information is represented 
in the brain, whereas in engineering this is specified 
by the designer. Ultimately it will be necessary to 
extend the frontiers of information theory if it is to 
encompass in its entirety the information process- 
ing techniques of neuronal networks. Such an ex- 
pansion would involve in part the greater integra- 
tion into information theory of systems and control 
theory from engineering and the theory of com- 
putation from mathematics. Whereas engineers 
aim to keep separate communication circuitry from 
computation circuitry so as to simplify the design 
and analysis of engineered systems, there is no rea- 
son why nature should maintain such a separation. 
Evolution tends to find efficient designs and not 
necessarily "simple" designs. 

It would be counter-productive though to as- 
sume that information theory in its current form 
could not be applied usefully in computational neu- 
roscience. One place it is immediately applicable 
is the early sensory pathways where information is 



primarily flowing in one direction. Considerably 
extra care must be taken when feedback loops are 
present. This is especially the case because (in ex- 
periments) we have control over the input signal 
itself and hence can investigate how a known sig- 
nal is communicated from one neuron to another. 
The complication though is that it appears the in- 
formation is being processed at the same time it is 
being communicated. 

The brain heavily compresses the information it 
receives from its sensory systems. Since the en- 
tropy of a signal determines precisely how much 
(lossless) compression is possible, it sets fundamen- 
tal limits which must be respected by any system, 
including biological systems. It is no surprise then 
that the estimation of entropy of neural signals 
based on experimental data is an active research 
area (IBorst and Theunissenl . Il999l : IPanzeri et al. 



20n7tlVuet all . 120091) . 



In the brain, neurons communicate with each 
other and transfer information. The primary 
means o f communic a tion a re the spikes of each 
neuron (jRieke et al.l . llQQTI) , and it is parsimo- 



nious to model their occurrences as dep e nding 
randomly on the neuron's input ( Poggiol Il964i 
Mainen and Seinowski 119951 ). Thus, neurons com- 
municate through a noisy channel, and mutual in- 
formation should therefore play an important role 
in understanding the nervous system and brain 

In order to consider a neuron as a communica- 
tion channel, we need to consider what we mean by 
"communication" in the specific context of biologi- 
cal neurons. There are several important concepts 
to consider before we can begin to discuss a specific 
example of the application of information theory in 
neuroscience. 



3.1 Communication Channels and 
Modulation 

A definition of communication requires the exis- 
tence of a physical medium that allows propagation 
of energy from one place (an "energy source" ) to 
another place where that energy has some causal 
effect (an "energy sink" jf|. We also need to define a 
means by which some property of the source can be 
altered in a way that results in an observable dif- 
ference at the sink after propagation through the 
channel. In communications engineering theory, 
the energy propagation is called "transmission," 
the source is known as a "transmitter" and the 
sink as a "receiver." These concepts are not suf- 
ficient for communication. There also needs to be 
an "information source" that is initially observable 



"Communication can also take place from the past to 
the future, in a fixed location, such as when writing to a 
memory device then reading it back again at a later date, 
but here we focus on place-to-place communication. 



at the transmitter's location, but not at the re- 
ceiver's. Communication requires the transmitter 
to alter the energy source in a manner that re- 
flects the information source, and that can subse- 
quently be observed at the receiver after propaga- 
tion. This conversion from information source to 
energy source is known as "modulation." 

A familiar example where each of these concepts 
is readily identified is analog AM or FM radio 
transmission, in which recorded sound signals are 
communicated, and then reproduced via a speaker. 
In this example, the transmission medium can be 
a vacuurrO or air, the propagating energy source 
is electromagnetic radiation, and the transmitter 
modulates the electromagnetic waves in a manner 
that reflects the recorded sound signal. AM is am- 
plitude modulation, and means that a single fre- 
quency sinusoidal wave of E-M (electromagnetic) 
radiation has its amplitude changed over time. FM 
is frequency modulation, which means the ampli- 
tude remains constant, while the carrier frequency 
is changed over time. 

3.2 Neuronal Spikes and Spike In- 
terval Coding 

Modulation of the energy source can be thought 
of as a code, since it requires a conversion from 
one kind of information representation to another. 
Indeed, in neuroscience, modulation has a more 
general meaning than in communications engineer- 
ing, and the conversion from an information source 
to variations in a parameter of the energy source 
is instead known as a "code." This is largely in 
contrast with communications engineering, where 
"code" instead refers to conversion between differ- 
ent representations of the information source prior 
to transmission at the source, for example "source 
coding" and "error correction coding." 

If we wish to consider communication between 
neurons, we need to identify the transmission 
medium, the form of energy propagation, and a 
modulation mechanism. From now on we will use 
neuroscience terminology, and refer to modulation 
as the "code." Further, we will refer to the infor- 
mation source as the "input," and the observable 
effect at the receiver that results from the input as 
the "output." 

Although over longer time scales the plasticity of 
neurons can encode/carry information, in shorter 
time scales the primary physical medium for com- 
munication seems to be the axons of neurons, and 
the energy propagation is a pulse-like wave of volt- 
age that travels along an axon where it may be 
received by other neurons at synaptic junctions. 
These pulses are known as action potentials, or 



spikes. Typical cortical neurons transmit spikes to 
many other neurons, and receive spikes from many 
neurons. 

While there are a number of different "communi- 
cation channels" in neuronal circuitry — including 
segments of the dendritic tree which carry post- 
synaptic potentials towards the soma of the cell — 
we choose to focus on action potentials because it 
is one of the most important communication mech- 
anisms between two neurons. 

The other concept we must also attempt to iden- 
tify is the way in which spikes are coded (modu- 
lated) in order to communicate information. Two 
possibilities are the height and the width of each 
spike. However, these are observed to be close to 
identical in most cases, and do not seem to be in- 
formation carrying parameters. Instead, it is the 
interval between spikes (ISI: inter-spike interval) 
that is thought to play an important role in carry- 
ing information through a neuronal channel. 

Given this, how do ISIs represent information? 
In neuroscience there are mainly two different 
ide as. One idea is that the ISI itsel f (see for exam- 
ple iMacKav and McCuUochl ( 19521 ) ) carries infor- 
mation. This is called "temporal coding" (Fig. [5a]) . 
The other is that the number of spikes in a 



fixed t ime interval (see ISteinI (J196l1) ; iLanskv et al 
( 20041 )) carries information. This is called "rate 
coding" (Fig.Eb]). 
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^Vacuum is thought of a transmission medium for elec- 
tromagnetic waves. 



(a) Temporal coding. (b) Rate coding. 

Figure 3: Two forms of spike codes. 

So far we have only stated that an input can be 
communicated to a receiver output. If this is a 
perfectly repeatable process, the rate at which in- 
formation can be transmitted depends on the rate 
at which the input is updated, and — in line with 
Section [2] — also depends on the probability distri- 
bution of the input, via its entropy. The transmis- 
sion is usually not perfect, and noise is introduced. 
This fact leads us to consider the information the- 
oretic concepts of mutual information and channel 
capacity. 

Information theory is not concerned with the 
type of modulation. It requires an abstraction that 
specifies only what the observable output variable 
should be. Since we are not designing a system, we 
must make some guesses about aspects of the input 
and output for a neuronal communication channel, 
and then proceed to calculations of mutual infor- 
mation. 

Therefore, in Section U] where we consider the 
channel capacity of a neuron model, we necessarily 
begin by specifically defining the input and output 
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of the channel, and state a model for the channel 
noise. 

4 Example: Channel Capac- 
ity of a Neuron 

In this section we present some of our results on the 
channel capacity for simple neuron models. One 
reason for providing this example, is to illustrate 
that there is no simple single formula for channel 
capacity, and hence assumptions about the under- 
lying model are very important. If these assump- 
tions change, the channel capacity also changes. 

As we have seen, channel capacity is the max- 
imum amount of information that can be trans- 
ferred through a noisy channel in a unit time. It 
may be much larger than the actual information 
transmission rate. This brings us to a natural ques- 
tion, that is, why do we need to know the capacity? 

Channel capacity is something similar to the 
maximum speed indicated in the speedometer of 
an automobile. While you will likely never drive 
with that speed, the maximum speed is useful be- 
cause it tells you the potential of the automobile, 
even though you drive with moderate speed. Chan- 
nel capacity provides not only the upper limit of 
the possible information transmission rate, but also 
describes how good the channel is. 

Although there is much interest in the quantity 
in neurophysiology (IBorst and Theun isscn. 1999), 
theor etic al work is r are ( MacKav and McCulloch . 



19521: ISteinl. Il967| I.TohnsoJ [JoTc : 



Suksompong and Bergei . 



2010l ). We have obtained 



some interesting results on the capacity from two 
different viewpoints. The details will be given be- 
low. 



4.1 Inputs and Noise of Channel 

We consider here a single spiking neuron, and as- 
sume that the input to the neuron controls the 
expectation of the neuron's output ISIs. Using 
the terminology introduced above, the information 
source modulates the ISI. We introduce channel 
noise to the picture by assuming that the ISI is a 
gamma-distributed random variable, when the in- 
put to the neuron remains constant. 

Biologically, each cortical neuron receives inputs 
from a lot of (pre-synaptic) neurons and each sen- 
sory neuron receives physical stimuli. The above 
assumption is to model all the inputs to the neuron 
as a single parameter 9. Although this assumption 
may seem too simple, is a time varying function 
and is able to represent a a lot of possible func- 
tions. In the gamma ISI model, the expectation of 
the ISI is given by k9. Because k is fixed, 9 is the 
input to the neuron. 



Due to refractoriness, a neuron cannot fire too 
fast; therefore the ISI cannot be but must be 
larger than a few milliseconds. On the other hand, 
if the ISI is too large, it means the neuron is not 
working. Thus we assume the input to the neuron 
is trying to control the ISI in a fixed range of time. 

The average ISI, which depends on 6 and k, is 
limited between ap and 6o, that is, 

ao < T = k9 < 6o, where < oq < 6o < oo- 

Thus, 9 is bounded in ^{k) = {9 \ aq/k < 9 < 
bo/KJ. 

4.2 Channel Capacity of a Single 
Neuron 

For a noisy channel, one important fundamental 
problem is to compute the capacity C. Another 
problem is to obtain the capacity achieving distri- 
bution. 

The family of all the possible distributions 7t{9) 
of inputs V is defined as 

■P = Jtt I tt{9) > for 6* G ^^(k), otherwise |. 

The mutual information and the capacity depends 
on the choice of an output variable. This is 
called "coding" in computational neuroscience, but 
"modulation" is an appropriate term in informa- 
tion theory. Traditionally, two types of modula- 
tions have been considered in computational neuro- 
science. One is "temporal coding" and the other is 
"rate coding" (see Fig. [3]). Note that both tempo- 
ral and rate coding may be used in the brain. For 
example, binaural sound localisation needs phase 
information and temporal coding seems natural 
while rate coding is appropriate for a motor neuron 
because muscles react according to the rate. 

We provide some results on the capacity of tem- 
poral and rate coding in the following. 

Temporal Coding 

In temporal coding, the received information is T. 
For a TT £ T', we define the marginal distribution 
as 

p{t;7r,K) = E^(e)[p{t\9;K)]. 

The mutual information between T and O is de- 
fined as 



IiQ;T) = E^^e) 



E. 



■p{T\e-K) 



log 



p{t\9-K) 



' p{t;T:,K) 
The capacity per spike is defined as 
Ct = sup 7(6; T). 

ttEV 
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(a) Temporal coding (k 



(b) Rate coding (k - 



Figure 4: Capacity achieving distributions for tem- 
poral and rate coding. For both coding types, the 
optimal input distribution is discrete, with a finite 
number of probability mass points. 



This optimisation problem cannot be solved an- 
alytically. However, it has been proven that 
the capacity Ct is achieved by a discrete dis- 
tribution with a finite number of mass points 

I 1 ' ' 

(see llkeda and Mantod (2009) for the details). 

Since the optimal distribution is a discrete dis- 
tribution with a finite number of mass points, the 
optimisation problem becomes simple, and we can 
compute the capacity and the capacity achieving 
distribution numerically. Figure |4a| shows the ca- 
pacity achieving probability distribution for k — 3. 
The chan nel capacity Ct is 34.68b ps (bit ber sec- 
ond) (See llkeda and MantonI (|200g) for further re- 
sults). 

Figure Ha] shows that the capacity is achieved 
when the input is a discrete memoryless distribu- 
tion^^l with 3 states. This does not imply that the 
brain is using discrete states. It is more plausi- 
ble that the brain is using continuous states; it is 
likely that the actual information transmission rate 
in the brain is less than the numerically computed 
capacity. 



Rate Coding 

In rate coding, a time window is set and the num- 
ber of spikes in this interval is counted. Let us 
denote the interval and the rate as A and R, 
respectively, and define the distribution of R as 
p(r\9;K, A). The form of the dis tribution of R is 
shown in llkeda and MantonI (|2009l ). For tt eP, let 
us define the following marginal distribution 

p(r;7r,K, A) = E^^Q-)[p{r\0; k, A)]. 

The mutual information of R and 6 is defined as 



7(9; i?) = ^,(e) 



E, 



'p{R\e;K,A) 



log 



,pir\e;K,A) 



p(r;7r,K,A) 



Hence, the capacity per channel use or equivalently 
per A is defined as 

Cr = sup I{Q;R). 



^'^The input is chosen from the three states independently 
at each time according to the probability distribution shown 
in Fig. |4a] 



This optimisation problem cannot be solved 
analytically either, but the capacity Cr has 
been proven to be achievable by a discrete 
distrib ution with a finite n umber of mass 
points (jlkeda and MantonI . l2009t ). Figure lib) shows 
the capacity achieving distribution for k = 3. 
The channel capacity Cb'xs 44 .95 bits per second 
(See llkeda and MantonI (J2009r) for further results). 



4.3 Tuning Curves 

The definition of channel capacity requires a max- 
imisation over all possible input probability distri- 
butions. This definition arose in an engineering 
context, where a system designer is assumed to 
have control over the inputs to the channel, but not 
the channel itself. A different optimisation prob- 
lem results if the input to the channel is assumed to 
be fixed, but some control over the channel is pos- 
sible. This idea is particularly relevant for studies 
of biological sensory transduction. In this context, 
an external stimulus that cannot be controlled by 
the sensing organism must be transduced and en- 
coded into action potentials for communication to 
the brain. This stimulus can be thought of as an 
input to a communication channel. 

Given internal noise in the transduction mecha- 
nisms, the encoded stimulus received by the brain 
is also noisy. Since we introduced mutual infor- 
mation in the context of the channel coding the- 
orem and digital data, and here our channel in- 
put is a sensory stimulus, mutual information may 
not seem relevant. However there are other rea- 
sons why it can be useful to ensu re mut u al in- 
formation is as large a.s pos sible (j Bergen . l2003t 
Johnson and Goodman! . 120081) and we therefore are 
interested in how the channel might be altered to 
maximise mutual information. 

But what can be optimised when the stimulus 
cannot be controlled? The only other variable that 
can alter the mutual information is the conditional 
distribution of the channel output given its input. 
In the biological context this is the distribution of 
a neural response for a given stimulus. Optimising 
this distribution means seeking to find the com- 
munication channel that it best suited to a fixed 
stimulus distribution. Without some constraints 
on the form of the conditional distribution, this 
would not be a meaningful task. One such con- 
straint is to consider a fixed form for the condi- 
tional distribution that has some parameters that 
can be optimised. Clearly the optimal parameter 
set may change for different input distributions. 

One reason for considering such an optimisation 
might be to assess whether neuronal mechanisms 
exist for adaptively altering the conditional distri- 
bution to match non-stationary stimuli. Another 
equally intriguing reason is the idea that evolu- 
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tion might have enabled neural systems to change 
parameters over eons with the end result that 
those parameters are information theoretically op- 
timal. In this scenario, the underlying fixed form 
of the probability distribution must be what it is 
due to unavoidable constraints, or perhaps gov- 
erned by criteria that are not in formation theo- 
retic, e.g. energy considerations fLaughlin et al., 
1998; Sarpeshkar, 1998 ; Lew and Baxter, 2002; 
Laughlin and SeinowskiL bOOa iBereer and LevvL 
201C 



these expectations as 






There are several potentially important sets of 
parameters that might be chosen. However, in 
other contexts there has been much interest in 
determining the optimal form of neuronal tuning 
curves, and this is our sole focus here. Experi- 
mentally produced tuning curves are plots of the 
mean response of a neu ral system, as a function 
of a stimulus paraineter (IDavan and Abbott . 2001 : 
Lanskv et all . l2008HMcDonnen and Stocksl 120081 ). 
Classic examples of a stimulus parameter include 
the angle of a moving bar of light relative to the 
receptive field of visual cells, or the sound pres- 
sure level of a single frequency (pure tone) sound 
played by a speaker. In these examples the av- 
erage response as a function of the stimulus de- 
fines the tuning curve. The former kind of tun- 
ing curve typically has a bell-shape, meaning that 
there is a stimulus that produces a maximal re- 
sponse, while more than one stimulus can produce 
the same lesser response. The latter kind has a 
sigmoidal shape, meaning that the mean firing rate 
monotonically increases with stimulus, and here we 
focus only on this case. 

We therefore wish to find the sigmoidal tun- 
ing curve that maximises mutual information, for 
a given fixed form of a conditional distribution. 
While this is generally a difficult optimisation 
problem, a simple solution exists for channels 
where the capacity achieving input distribution is 
discrete, like those considered in this paper. In 
fact, the mutual information maximising tuning 
curve can be derived for an arbitrary stimulus dis- 
tribution, if the capacity achieving input distribu- 
tion has been calculated first. The reason for this 
is explained in the following. 

We consider the same noisy channel as Sec- 
tion 14.21 that is the ISIs are governed by a gamma 
distribution. We now make a slight generalisation 
to the setup of Sections 14.11 and 14.21 and consider 
the expectation of the ISIs to be governed by a ran- 
dom variable X with a known distribution, such 
that Q is an arbitrary function of X. 

We therefore write 8 = f{X). For the gamma 
ISI channel, the tuning curve is defined as the con- 
ditional expectation of the response variable (ei- 
ther T or R) given a specified outcome of X. For 
timing coding or rate coding respectively, we write 



and 



Ep{T\x;K)[t] = K.f{x) 



Ep{R\x-K,A)M 



Kf{x) 



Since 9 = f{X), when the capacity achieving dis- 
tribution for 6 is discrete with say M states, the 
tuning curve will also consist of M unique values. 
Lets call these /xi,..,/xm- While this discontinuous 
tuning curve achieves the largest possible mutual 
information for the channel, it is not unique; other 
tuning curves are equally good. Suppose X is a 
continuous random variable, and that the tuning 
curve maps large intervals of X to the same M 
values /ii, . . ., ^M- This tuning curve provides the 
same M possibilities for the conditional expected 
ISI. In order for it to provide the same mutual in- 
formation achieved by the original discrete capac- 
ity achieving distribution, it is necessary that the 
probability with which each fim occurs is the same 
in both cases. It has been proven for a special case 
of the gamma distribution and rate coding that 
this can be achieved by appropriate choice for the 
ranges of X that are mapped to each /i,„. The 
resulting optimal discrete tuning curve is then de- 
pende nt only on the prob ability density function 



of X (JNikitin et al.l . l2009l ). 



Consequently, the capacity achieving input dis- 
tributions derived in Section 14.31 can be converted 
to an optimal tuning curve for any choice of the 
stimulus distribution. An example of the capacity 
achieving tuning curve is shown in Figure [SI for 
the special case of /t = 1, which means the chan- 



nelis equivalent to a Poisson neuron (jNikitin et al 
20091) . The maximum rate is restricted to 30 spikes 
per input sample. 

Although such a result holds exactly only for 
discrete input distributions, similar derivations of 
information theoretically optimal continuous tun- 
ing curves have be en made, which hold only i n 
the low noise limit (iMcDonnen and Stocksll2008[) . 



See iKostal and Lanskvl (|2010l ): iKostall (|2010[l for 
related work in the high noise limit. 

4.4 Discussion and Interpretation 

We have shown our results on neuron channel ca- 
pacity from two very different viewpoints. Inter- 
estingly, both show that the capacity is achieved 
by a discrete distribution. The numerically com- 
puted capacities are similar to the range indicated 
by some biologically measured result s of sensory 
neurons ( Borst and Theunissenl . Il999l ) . The chan- 
nel capacity depends on various factors, and we 
consider some of them below. 
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Figure 5: Channel capacity achieving tuning curve 
for a Poisson rate-coding neuron (ac = 1), and an 
input X G [0, 1], with a maxir num spike-rat e of 30 
spikes per input sample — see iNikitin et alj ( 2009f) 
for further examples. 

4.5 Input and Output 

We first discuss the input and output of the neuron 
channel in this subsection. 

Let us start with the input. Although each neu- 
ron receives information from many neurons, we 
have only considered a single input 6. This may 
seem too simple. We assumed that the single input 
9 summarises all the inputs to the neuron. More- 
over, 6 has been assumed to be memory less and can 
have any distribution within the support. Consid- 
ering the biological system, this is far from real- 
istic. The net input of a neuron may not change 
quickly, that is, it has memory. Moreover, a neu- 
ron's input is a collection of many neurons' noisy 
outputs, therefore, it may follow a particular prob- 
ability distribution. This implies that we have 
computed capacity under less restrictive assump- 
tions, and the biologically achievable rate should 
be smaller than the capacity obtained in the nu- 
merical studies. A better understanding of the 
constraints on the input of a neuron would lead 
to a more accurate calculation of neuronal channel 
capacity. 

Next, we discuss the output of a neuron from 
two viewpoints, decoding and demodulation. 

In order to achieve channel capacity, the re- 
ceiver must act as an optimal decoder, meaning 
that when X is observed, the receiver must com- 
pute the posterior distribution of the input X as 
p{x\x). When x takes discrete values xi,--- ,xl, 
it becomes p{xi\x). This is a real number for each 
Xi. In engineering, this type of decoding is called 
"soft decoding." It seems unlikely that a neuronal 
mechanisms for carrying this out could exist, since 
a computation of the posterior distribution is nec- 
essary. 

Another standard decoding technique is "hard 
decoding," that is, only a single value of Xi is cho- 
sen by the decoder. The optimal hard decoder 
chooses the one which maximises the posterior dis- 
tribution, that is Xi = argmax^. p(a;i|i). It is pos- 



sible to implement the optimal hard decoder with- 
out requiring the online computation of the pos- 
terior distribution. Indeed, the output space (the 
space where X lies) can be divided into L sub- 
spaces in advance, such that the k th subspace 
contains all the points X such that Xk has the 
largest posterior probability given X. Therefore, 
the optimal hard decoder can be implemented by a 
"quantisation" or "thresholding" algorithm which 
simply checks to see which subspace X lies in. Such 
an algorithm is often computationally simpler than 
computing first the posterior distribution. 

When we consider information processing in the 
brain, a naive soft decoding seems difficult, at 
least for a single neuron. The hard decoding 
idea seems more natural. However, the informa- 
tion transmission rate of the best hard decoding is 
less than the best soft decoding. We should keep 
th is point in mind. Further discussion is found 



Ikeda and MantonI ()2009l ) 



4.6 Discreteness 

Under the assumptions made, we have shown that 
the capacity of a neuron is achieved by a dis- 
crete distribution with a finite number of proba- 
bility mass points. In information theory, there 
are many other known types of channels for which 
channel capacity is achiev e d by a discrete distri- 
bution (IHuang and MeviJ . 120051) . For example, 
although the capacity of an average power con- 
strained additive Gaussian white noise channels is 
achieved by continuously valued Gaussian inputs, 
simply placing a constraint on the maximum am- 
plitude of the channel me ans capacity is achieved 
by a discrete distribution ( Smithl . [ 



1971 ) 



Our results do not imply that "neuron signals 
are only using discrete levels." On the contrary, 
we believe many neurons are using continuous lev- 
els. What is implied by our results is that those 
neurons cannot achieve the capacity and the ac- 
tual rate of information transfer is therefore less 
than the capacity. Another implication is for the 
measure ment of information c apaci ty in neuro- 
science (jBorst and Theunissenl . Il999l ). Our result 
implies that only a small number of discrete ranges 
are sufficient for the input distribution to measure 
the information capacity of neural coding. 

Information theory provides a way of quantifying 
whether discrete or continuously distributed sig- 
nals are better for any given communication chan- 
nel corrupted by random fiuctuations. Many neu- 
roscience studies quantify information using Shan- 
non's famous information capacity formula relat- 
ing mutual information to signal-to-noise ratio. 
This formula is correct only for Gaussian additive 
noise channels ( Cover and Thomasi 120061 ) — it re- 
lies on many assumptions, and if any are false it 
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can signi ficantly under- or over-est iniate the true 
capacity (JBerger and Gibsonl . ll998l ). 



4.7 Further Points 

Guessing the channel model is not straightforward. 
In particular, feedback is prevalent in many parts 
of the brain, which makes it much more diffi- 
cult to relate changes in responses to inputfjlj- 
One important example where it is known that 
the circui try is solely f eedfor ward is that of reti- 



nal cells ( Jacobs et all 120091 ) . This example has 



been used to demonstrate that it is possible to 
rule out certain guesses of neural codes. Because 
this is based on optimal Bayesian decoding of a 
force d binary choice, it w ould be interesting to ex- 
tend ( Jacobs et al.l . 120091) beyond the binary limi- 
tation to that where the signals being coded may 
have many possibilities 

What can we say about more complicated sit- 
uations? For example, information processing of 
cortical neurons are not strictly feedforward and 
the information is shared by many neurons. If we 
assume every neuron is performing the same com- 
putation, and each neuron encodes and decodes 
information in the same way, it seems possible to 
extend our results. However, when different neu- 
rons encode information differently, the problem 
becomes very different. Understanding how the 
brain works in information theoretic terms is one 
of the grand challenges of this century. 

5 Conclusions and Future Di- 
rections 

Most information theory research to date has 
been predicated on an engineering viewpoint. 
The main thrust has been to design compres- 
sion/decompression methods that compress source 
information to the limits given by its entropy 
(source coding), or to design error correction 
schemes and encoding/decoding methods that al- 
low communication close to capacity. 

On the other hand, the goal of neuroscience is in- 
stead to understand information processing in the 
brain. 

This does not mean that information theory has 
no place in computational neuroscience. Informa- 
tion theory provides a way to measure informa- 
tion and to understand the limits of compression 
(namely, entropy) and communication (namely, 
mutual information and channel capacity). 

A common "information theoretic" method in 
computational neuroscience is to obtain quantita- 
tive estimates of the mutual information between 
observed sets of data. Since various methods for 



the (difficult) problem of accurately estimating 
mutual information exist, the bigger difficulty is 
with using the result to say something about how 
the system works. Indeed, the actual goal of com- 
putational neuroscience is that of "system identifi- 
cation," as engineers would call it. 

If the brain uses spikes to transmit information 
(which on faster time-scales appears to be the case) 
then understanding the neural code — how the 
brain encodes information before sending it across 
the "channel" — is tantamount to understanding 
how the brain works. Indeed, if we would listen in 
to the messages as they are sent from one neuron 
to another, it would be relatively straightforward 
to determine what each neuron is doing. 

Although system identification is not the fort e 
of traditional information theory ( JohnsonL 120081 ) . 
this is merely for historical reasons. With compu- 
tational neuroscience as a main motivator, we pre- 
dict that the next decade will see the expansion of 
information theory to include more powerful tech- 
niques for system identification, and possibly even 
an integration of control, computation and infor- 
mation theory into a unified framework. 

Some recent information theoretic approaches 
in neuroscience that go beyond standard Shannon 
theory are summarized in the following list. 

• While it is traditional in engineering to sepa- 
rately code an information source, and then 
channel code it for communication across 
a channel, it has been shown for some 
simple but instructive examples, that non- 
separation of these two aspects can achieve 
an optimal communication system with a 
vastly re duced complex i ty co mpared to sep- 



aration (jGastpar et al.l . 120031) . This fact is 



potentially important for neurobiological sys- 
tems, where separation mechanisms seem im- 
plausible. 

• Many studies have investigated whether the 
brain might have 

mechanisms for implementing Bayesian algo- 
rithms during decisio n making, predicti o n and 
pattern recognition (I Rao and BallardL 1999 : 
S.Lee and Mumfor dl 120031: iKnill and Poueet . 
:2004 iGeorgeand Hawkins, l2009t ). 



• The possibility of analog cortical e rror correct- 
ing co des has been proposed by iFiete et al 
( 200ah . 



^^ Channels with feedback can have larger capacity. 



One limitation of Shannon theory is that 
its measures say nothing about directional- 
ity or causality. However, directed informa- 
tion theory has also been developed (IGrangei , 
19691; iMa rko'. '1973': 'Rissancn and Wax', 'l987; 
MassevI , |1990,: .Tatikonda and Mitter. , ,2009 ) 
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and has recently b e en ap p lied in neuro 



science (Hesse et all 12003: 



Waddell et al 
2OIIU Q 



it all 
2007^ 



uinn et al.l . l201ir ). 



'EichleiJ, '2OO6'; 
Amblard and M ichel. 



• The relationship between control, informa- 
tion theory and thermodynamics has b een dis- 
cussed by Mitter and Newton (,2005il ; iFriston 
( 2010^ : lMitteI^(l2010l) . 



Summarising our own modest contribution, wc 
carefully came up with a simple neuron channel 
model based on biological evidence and computed 
the channel capacity of this model. Interestingly, 
it was proved that the channel capacity is achieved 
by a discrete distribution. 
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